-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce the buffer when using high dimensional data in distributed mode. #2485
Conversation
src/io/dataset_loader.cpp
Outdated
bool force_findbin_in_single_machine = false; | ||
if (Network::num_machines() > 1) { | ||
int total_num_feature = Network::GlobalSyncUpByMin(num_col); | ||
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still lead to overflow, need to cast the operands before assigning to a wider type.
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature; | |
size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature; |
src/io/dataset_loader.cpp
Outdated
bool force_findbin_in_single_machine = false; | ||
if (Network::num_machines() > 1) { | ||
int total_num_feature = Network::GlobalSyncUpByMin(dataset->num_total_features_); | ||
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, avoids overflow.
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature; | |
size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature; |
These changes broke something in distributed training, now datasets that were fine before also throw an MPI error on an
|
I find the total bin of different machines are different. |
and different init_score, it seems the dataset used in these machine are different |
That is indeed weird. My data and LightGBM distribution lie on shared NFS, so all nodes have access to the same data. As I said, the exact same dataset (avazu-app.val) trains to completion on the current master, so I think this PR has some kind of side-effect on the data loading. |
@thvasilo sorry, I realize there is one more place i need to fix too, I will update this soon. |
Tried the current PR, But I'm wondering if this is a regression or expected behavior as the same dataset trains fine on master, but as before with different numbers of bins and init values, which I guess is bug?:
|
@thvasilo did it throw the warning "Communication cost is too large for distributed dataset loading, using single mode instead."? |
if the warning exists, i think the error is caused by overflow. |
Unfortunately new errors popped up, on the datasets I've tried This is for avazu-app.t (1M features)
And this is for kdda.t (20M)
|
yeah, actually, the master branch is okay, expect the print information is not right. |
Thanks for the prompt fix! avazu-app.t will train fine now, still getting the same for kdda.t though.
|
Thanks! I will find out what cause that checking failed. |
@thvasilo I think the latest commit should fix the check error. Could you try it again? Thanks very much! |
Thanks @guolinke it does indeed work now! One last question I have is about Here's some example outputs for kdda:
and avazu-app.val:
|
the init_score is sync. you can find:
the last 3 lines are the sync score in 3 nodes. |
Thanks @thvasilo very much! now the distributed mode can be efficiently ran over sparse features too. |
One thing I noticed, when setting
If I set the parameter to 2 (or other values > 1) this does not happen. Seems like a corner case :/ For me the current PR is good enough, just something to keep in mind. |
@thvasilo it is not trival to locate what cause the fail, I add a possible fix, you can have a try 😄 |
Now it takes much longer, but still get an error. I'd just note this as a known issue if it is to be included in the release, as you said this would be hard to track down.
|
Thanks, I think I need to debug for that. Could the fail reproduce when run in the single node? |
Running on a single node (no MPI parallel training) seems to be working fine. One question I had: when I set
|
@thvasilo yeah, it is expected behavior. |
@thvasilo could you try one more time? If this still failed, I think I cannot fix it temporarily. |
@guolinke
|
Hello @guolinke sorry I got caught up with other stuff. Can confirm that training on kdda.t (20M features) with Example output:
|
to fix #2484